Predicting reddit post performance using Twitch emotes on Livestreamfail

Michael Puerto (Mowgli)

2021-05-07

Abstract

Twitch.tv is home to millions of internet communities, gamers, role-players, and athletes alike. With the emerging interest in live streaming on the internet, twitch has found itself being the largest online gaming streaming platform in history. A large proponent of twitch’s success on the internet is its live chat which gives its users the ability to interact with the broadcaster(streamer) in real-time using text and customized emojis, otherwise known to the community as emotes. These custom emotes are user-generated either by Twitch, Broadcasters, or Viewers. BetterTTV and FrankerFaceZ are important third-party features that allow these community-generated emotes to be used across Twitch.

Furthermore, twitch chat and its interaction with the video characteristics of broadcasts have received no attention in live streaming literature since broadcasts can go well beyond five hours. This project aims to build upon recent work regarding emote based sentiment, and community emote-profiles by including the chat interactions from the “Twitch Clip” tool. Twitch clips are user-generated “highlights” to capture important moments of the broadcast and commonly shared on platforms like YouTube and Reddit. Recently, twitch literature has begun characterizing twitch communities using twitch chat, viewership trends, and content but no known projects have used resources that exist outside of twitch to understand how twitch communities manifest and interact with one another. Studies that exist have predominantly used most viewed broadcasters with large communities to begin understanding twitch user behavior. It’s known that the size of communities on twitch plays a large role in how users behave in the chatroom. One way we can get insight into these differences between small and large communities is by analyzing reddit posts that feature “twitch clips”. One subreddit called LivestreamFails (LSF) is a dedicated subreddit where users share livestream and streamer related content. LSF is one way smaller streamers become noticed and is a platform that can be use to compare big and small communities.

In this analysis will investigate features related to post virality (+-500 Karma) on LSF while building an R package to navigate the Twitch API and return chat information from Twitch Clips. Understanding the power of emotes and the influence that LSF has on the twitch community would be beneficial for growing streamers, company sponsors and Twitch so that they may grow the community and build products. The results are expected to extend our understanding of twitch emotes while also qualitatively and quantitatively characterizing twitch communities to improve twitch stream recommendation systems. This analysis impacts the Twitch, Reddit and R programming communties.

Preface

What is a subreddit?

A subreddit can be thought of as a ‘sub-community’ which discuss and share topics relevant to the ‘sub’s’ topic, name or description. For example, the coffee subreddit would be about all things coffee! It would include discussions about a specific roasting technique, coffee bean deals, or tools used to drink coffee to name a few. The subreddit that I will be investigating in this analysis is one called Livestreamfail, also know to the community as LSF.

On LSF, you can find users discussing all things related to ‘livestreaming’. The discussions can range from livestreams themselves to the streamers who do the broadcasting.

My previous analysis of LSF indicated that the majority of topics discussed were ‘Twitch Clips’.

In my previous analysis, I examained community emote use for the top 9 streamers present on LSF. This plot represents the karma (points) for the top 6 streamers during 2018 and 2019

This network graph shows third-party emote use among the top 9 streamers from LSF.

What are Twitch Clips and Emotes?

Twitch Clips

You can compare a Twitch clip to brief highlight of your favorite T.V. show with typed comments by the audience. They are short, highlighted portions of a live-broadcast that feature comments from the live-audience.

Notice how in the chat box on the right, there are users who are chatting ‘live’ during the broadcast. Each user is different in their names, colors and ‘user badges’ which can signify they level of support or their role in the community. Also, notice the lack of words used in the chat box. Most users within this community use emotes instead of words.

Twitch Emotes

What separates ‘live streaming’ from traditional ‘live’ TV are the interactions between the viewers and ‘streamers’ in real-time. The viewers have the ability to chat to the ‘streamer’ using plain text and emojis.. Or emotes! Just like emojis, emotes are used to express some sort of emotion. Generally, emotes are used to react to what is being said or shown on the stream. But better than emojis, emotes are custom to Twitch and it’s community.

Some emotes are the same, but not all!

For example, the laughing-crying emoji can be found on most social media platforms and on phones. This emote is also found on twitch but instead of using the laughing-crying emoji, the community opt to use the community designed emotes.

You may notice that these emotes appear to be laughing, but not all emotes are the same. For example, the emote on the left features a man who is laughing, but this emote is used in a different way.

That emote is known as 4head, and is used to mock some one or to give useless advice.

Purpose of the project

Understanding the power of emotes and the influence that LSF has on the twitch community would be beneficial for growing streamers, company sponsors and Twitch so that they may grow the community and build products.

This project aims to use Twitch emotes to understand and predict reddit post performance on a Twitch related subreddit.

Data Collection

Python initial data

The data collection journey began with Python and using the PRAW library to interact with the Reddit API. PRAW is the abbreviation for the Python Reddit Api Wrapper. Once I set up the credentials and reddit developer account, I used python to return a dataset containg 1000 of the newest posts (4-14).

Post Title Post Score Created UTC Post Author domain url
Twitch bans the word obese for predictions 24543 2021-04-10 15:47:27 kingpussay clips.twitch.tv https://clips.twitch.tv/CarelessBlatantNoodleTebowing-H7VBqqNa25gowTSU
After a month of nearly constant streaming, Ludwig’s subathon ends 22763 2021-04-14 03:59:38 xerxestrux clips.twitch.tv https://clips.twitch.tv/RenownedExcitedSangPartyTime-K_bahdAs28WWSdR7
Cadian with the play of a lifetime to win grand final csgo pro league 22754 2021-04-11 20:48:07 Areax clips.twitch.tv https://clips.twitch.tv/GrossKnottyToothShazBotstix-CkHHYXu6bI1LWCWt
Cornwood takes X phone 15575 2021-04-09 16:41:53 Intrilo clips.twitch.tv https://clips.twitch.tv/AmazingTransparentTigerLitty-wYqeVF9FuC4cuuzN
Korean streamer finds a clip he likes 13337 2021-04-11 08:52:04 PantsOfAwesome clips.twitch.tv https://clips.twitch.tv/CheerfulSmellyHerdOSkomodo-Zn-Y1kC9ecSchP1y
xQc gets NVL’d 12235 2021-04-13 22:41:05 SmileySY_ clips.twitch.tv https://clips.twitch.tv/NimbleProudBeePanicVis-JnqzmPsUt0RwBn7z

The post.score indicates how ‘successful/viral’ a post is. Aside from post.author, post.date and other features that can indicate how well a post will do, I will follow the post.url back to the relevant twitch clip and add this data to the larger dataset.

For a running example, I will use Ludwig (a popular streamer) ending his sub-athon to demonstrate.

X1 Post Title Post Score Created UTC Post Author domain url
74 After a month of nearly constant streaming, Ludwig’s subathon ends 22763 2021-04-14 03:59:38 xerxestrux clips.twitch.tv https://clips.twitch.tv/RenownedExcitedSangPartyTime-K_bahdAs28WWSdR7

Rchamp and Chat Summary

Continuing with the previous Lugwig clip.

## [1] "Got the clip data B)"
## [1] 10.04615
## [1] 30.31923
## [1] 49.53846
## [1] 65.04615
## [1] 84.51923
## [1] 99.98846
## [1] "Downloading 100% complete"
## [1] 100

This Twitch clip features 335 users and 335 messages.

In order perserve one observation per reddit post, I elected to take a bag-of-words like approach by summarizing key emote representations for each twitch clip. For the top 4 emotes in the Twitch clip, I wanted to know the name of the emote (Sentiment), the number of times it occurs (Size of audience), and it’s total proportion compared to all emotes in the chat (Chat Congruency).

Twitch Clip

Using Rchamp, we can extract the raw data from the twitch clip. See the Rchamp portion of the article for more information on Rchamp.

message user badges badge_version global_emotes
60 ludwig7 ludwigStar ludwig7 ludwigStar ludwig7 ludwigStar BigGucciD_ vip 1 ludwig7 ludwigStar
61 Jedijed419 gifted a Tier 1 sub to Xehao! Jedijed419 subscriber 0 No Emotes
62 shawn6991 subscribed at Tier 1. shawn6991 subscriber 0 No Emotes
63 caramelmochaaa subscribed at Tier 1. caramelmochaaa subscriber 0 No Emotes
64 Cjthegamer37 is gifting 1 Tier 1 Subs to ludwig’s community! They’ve gifted a total of 1 in the channel! Cjthegamer37 subscriber 0 No Emotes
65 Jett_Fighter is gifting 5 Tier 1 Subs to ludwig’s community! They’ve gifted a total of 111 in the channel! Jett_Fighter subscriber 3 No Emotes

Sample Final Dataset

Post Title Post Score
After a month of nearly constant streaming, Ludwig’s subathon ends 22763
url first_emote first_emote_count first_emote_proportion second_emote second_emote_count second_emote_proportion third_emote third_emote_count third_emote_proportion fourth_emote fourth_emote_count
https://clips.twitch.tv/RenownedExcitedSangPartyTime-K_bahdAs28WWSdR7 ludwig7 148 73.27 BibleThump 12 5.94 HypeOni2 11 5.45 FeelsStrongMan 6

Actual Final Dataset

X1 Post Title Post Score Created UTC Post Author domain url first_emote first_emote_count first_emote_proportion second_emote second_emote_count second_emote_proportion third_emote third_emote_count third_emote_proportion fourth_emote fourth_emote_count fourth_emote_proportion
2 Chang Gang OOC talking about what the chat hoppers are talking about and calling XQC “dogshit” 6210 2021-03-23 05:43:58 Xenoleff clips.twitch.tv https://clips.twitch.tv/DrabHealthyDiamondUWot-sG2Ym1Gjb6th2TdG rameeEZ 65 19.35 UHM 60 17.86 OuttaPocket 44 13.10 LUL 41 12.20
20 Happy birthday from the news to Erobb 633 2021-04-14 18:58:07 EchoOk4335 clips.twitch.tv https://clips.twitch.tv/SparklyArtsyPlumPeteZaroll-CwObrnd2t4tbY5Xu OMEGALUL 66 39.52 PepeLaugh 46 27.54 WutFace 19 11.38 emoneyLW 4 2.40
33 LIRIK decided to keep playing Bayvon 1138 2021-04-14 16:30:35 DzejBee clips.twitch.tv https://clips.twitch.tv/IronicPlacidTardigradeSoBayed-j72w5GfJDPxI-1IG Pog 108 72.48 Clap 8 5.37 lirikH 6 4.03 PogU 6 4.03
35 Malena spout the forbidden word 684 2021-04-14 15:54:17 kingpussay clips.twitch.tv https://clips.twitch.tv/AlluringCulturedBeefYouDontSay-j0xT9jFICvcY06MD Clap 51 20.48 Sadge 36 14.46 PogU 29 11.65 EZ 23 9.24
39 Heokong aka ‘King Coomer’ loses his girl to european league of legends player 2626 2021-04-14 15:19:42 Halfhander clips.twitch.tv https://clips.twitch.tv/FuriousBashfulCiderThunBeast-INwYofndhRYNdCss KEKW 104 44.26 BOOBA 27 11.49 Pog 27 11.49 EZ 11 4.68
47 39Daph calls out NymN 669 2021-04-14 13:58:16 AS43_ clips.twitch.tv https://clips.twitch.tv/CleanHedonisticEyeballSpicyBoy-sNvgti7NCR3DCLBf OMEGALUL 13 46.43 KEKW 4 14.29 LULW 3 10.71 lacOMEGA 2 7.14
More Information: Rchamp

The package is available to view and use from github: https://github.com/mowgl-i/Rchamp

This package works by using the ‘slug’ of a twitch clip! The slug is the end-portion of the Twitch clip url.

For example the Lugwig clip we have been using looks like this -> " https://clips.twitch.tv/RenownedExcitedSangPartyTime-K_bahdAs28WWSdR7"

The slug for it is -> “RenownedExcitedSangPartyTime-K_bahdAs28WWSdR7”

Using this information, the preset functions within the package will get the information of the twitch clip. It will store the timestamp of the clip from the original broadcast, and the clip’s total length.

Using that information, we can ask the API to return information within that time window using a series of requests untill the clip length is reached.

After that, you would have a dataframe of all the chat messages for that clip.

To summarize the clips data and attach them to other reddit posts/ clips,I’ve used a series of for loops to walk through a list of Twitch clips, download them, and attach them to another dataframe.

Machine Learning

Set up

Since I have all the data I need to begin answering my original question, we can use machine learning finally answer the question.

For this task, I will be using a supervised machine learning model which ‘learns’ from training data and is evaluated on testing data.

For example our reddit dataset of 756 observations, we will use 70% of it to allow the model to learn the patterns and have it test what it’s learned on the remaining 30% of data, which it’s never seen before.

A popular algorithm to use at the begining of model development is the Random Forest model. The Random Forest model is popular because of it’s robustness against outliers and it’s ability to manage over-fitting variables. Though it’s imputation of missing values (mode) is not the best in my opinion, it’s bootstrap resampling make it robust against outliers. One issue with Random Forests are that they can be slow to train.

The data was seperated into ‘Successful’(More than 500 Karma) and ‘Not-Successful’(Less than 500 Karma) categories based off of the score the post recieved.


Since this analysis has not been done yet, I decided to train a Random Forest model for its simplicity and easy to understand implementation. I deployed hyper-parameter tuning on this model to give the best possible classification results using Area under the Curve (AUC).


Results

The best possible model that can be derived from this data using RF produces an accuracy of 69.9% and an AUC (Area Under the Curve) score of 59.8%. Sensitivity is 0.94 and Specificity is 0.07. This would suggest that the model is good a predicting which posts are successful(viral), while being not so good at identifying posts that are not successful.

This plot indicates the AUC for possible tuned parameters mtry and min_n for the Random Forest model.

This plot is a variable importance plot. This plot indicates which variables/features were most important in predicting reddit virality by the decrease in model error.

There are 2 things that I glean from the variable importance plot.

The first thing is that the frequency and congruency of the chat are quite important, as they indicate how active the chat is. Also how ‘similar’ or herd like the chat is.

The second thing is that the emote OMEGALUL shows up. One thing that I aim to do is to move away from raw emote names and to score the twitch clips using a emote-sentiment lexicon. In this model, the variable importance plot indicates that ‘Funny’ posts are key in identifying posts that go ‘viral’.

Conclusions

A very close EDA of the data (also Sensitivity/Specificity) shows that there are more than few posts that contain lots of interactions and emotes but did not gain traction on LSF.

Emotes alone are not enough! More data is needed about Twitch clips in order to understand why posts go viral.

Some ideas are to include broadcaster, total followers of the broadcaster, total views of the clip, emote + word sentiment scores, and reddit post author.

Different modeling techniques, both supervised and unsupervised, should be tried. For example, I will explore a clustering algorithm to cluster ‘similar’ emotes and attribute sentiment values to the clusters. Using this information would allow other researchers to futher understand the nature of Twitch clips.

Everything used in this analysis can be found in this github repo: https://github.com/mowgl-i/LSF-Success

The Rchamp package can be found here: https://github.com/mowgl-i/Rchamp

This project serves the R community, Reddit community and the Twitch community.

The goal is to understand the power of emotes and the influence that LSF has on the twitch community for growing streamers, company sponsors and Twitch so that they may grow the community and build products.

Next Steps

1: Report my findings to folks over at Livestreamfail!

2: Gather more LSF data and engineer more Twitch variables

3: Continue developing Rchamp

4: Submit package to JOSS (Journal of Open Source Software)

5: Develop Rchamp web application for initial clip insights.